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Abstract 

We use convex relaxation techniques to produce lower bounds on the optimal value of subset 
selection problems and generate good approximate solutions. We then explicitly bound the 
quality of these relaxations by studying the approximation ratio of sparse eigenvalue relaxations. 
Our results are used to improve the performance of branch-and-bound algorithms to produce 
exact solutions to subset selection problems. 

1 Introduction 

We focus here on the subset selection problem., i.e., solving least squares regressions while con- 
straining the number of nonzero regression variables to be less than a certain target. This problem 
is often called feature selection or sparse least-squares. Its combinatorial nature makes subset se- 
lection intractable. Several techniques have been derived to produce good approximate solutions 
however, using for example greedy algorithms or sparsity inducing penalties. 

Given a design matrix X S R"^^ and a response vector y S R", we consider the following 
subset selection problem 

minimize \\y ■ 
subject to Card(tt;) < k, 

i n the variable w E R^, where /c is a parameter controlling sparsity. It was shown in NataraianI 
(|l99.^ ) that while 1^ is NP-Hard, simple greedy algorithms can efficiently produce good approx- 
imate solutions. Subset selection can also be understood as Iq norm constrained regression (or 
approximation) and a very large body of works focused on replacing the combinatori al in norm 
with a convex £i norm constraint, with £i norm regression usually known as LASSO Tibshiranil 

(11996|L. 

Explicit variabl e selec tion consistency resu lts have been derived in ce rt ain regimes (see 



Xw\\l 



(1) 



Meinshausen et al. ( 2007 )). and recent results Donoho and Tanner ( 20051 ): Candes and Tad 



(j2007l ) have shown that under certain conditions on the design matrix X, the solutions of the 
£i problem coinci ded with that of the in problem . Seve r al au th ors have attacked the i o prob - 
lem directly, with Narendra and Fukunaga ( 1977 ): Hand ( 1981 ): Furnival and Wilson Jr ( 200d ) : 
Moghaddam et al.l (120081) using b ranch -and-bound techniques to produce exact solutions to prob- 
lem ([1]), with Moghaddam et al. ( 20081 ) in particular using interlacing properties of eigenvalues to 



*INRIA- WILLOW Project-Team, Laboratoire d'Informatique de I'Ecole Normale Superieure (CNRS/ENS/INRIA 
UMR 8548), 23, avenue d'ltalie, 75214 Paris, France, francis.bach@mines.org 
^ORFE, Princeton University, Princeton, NJ 08544. sahipasa@princeton.edu 
*ORFE, Princeton University, Princeton, NJ 08544. aspremonSprinceton.edu 



1 



speedup branch-and-bound methods. Solving the in problem in dH) even for small values of p has 



direct applications in image denoising Elad and Aharon ( 20061 ): Mairal et al. (j2008l 



All the algorithms listed above produce good approximate solutions, hence upper bounds on the 
optimal value of the subset selection problem ([T]) . Our first contribution here is to use convex re- 
laxation techniques to produce lower bounds on the optimal value of ([T]). In particular, this result 
allows us to bound the suboptimality of approximate solutions and improve the performance of 
branch-and-bound algorithms for subset selection. We also use randomization techniques to gener- 
ate good solutions to ([I]), often improving on solutions produced by greedy or LASSO algorithms. 
Our next main contribution is to derive approximation bounds on the performance of the sparse 
eigenvalue relaxation/randomization algorithm. Finally, we test our algorithms on various subset 
selection problems and show that the lower bound derived here considerably reduces the number 
of branches required to produce an optimal solution to ([T|) . 

The paper is organized as follows. In Section [2] we show how to produce lower bounds on 
the optimal value of problem ([TJ using relaxation bounds on sparse eigenvalues. In Section [3l we 
describe greedy, randomization and branch-and-bound algorithms to generate good approximate 
solutions w to Q using the product of the relaxation. In Section H] we produce a bound on the 
approximation ratio of sparse eigenvalues, thus bounding the quality of the approximation of the 
subset selection bounds derived in Section[2j Section[5]shows how to efficiently solve our semidefinite 
relaxation using first-order methods. Finally, Section [6] presents some numerical experiments. 

Notations 

Given matrices X,Y ^ R"^^, we write X oY their Schur (componentwise) product, while Amax(^) 
is the leading eigenvalue X, \\X\\i the sum of absolute values of the coefficients in X and Xi 
is the i^h column of X. We let Sp be the set of symmetric matrices, and for Y G Sp, we write 
diag(y) G its diagonal. When y G IV, diag(y) G Sp denotes de diagonal matrix with diagonal 
coefficients equal to the coefficients of y, while Card(y) is the number of nonzero coefficients in y. 

2 Relaxation &; Lower Bounds 



Following d'Aspremont et al. ( 20081 ) for example, we first recall how solving problem ([T]) is equiva- 



lent to computing sparse eigenvalues of a matrix formed using X and y. We let ij{k) be the optimal 
value of the subset selection problem ([T]), with 

ip{k) = min. \\y - Xw\\l , , 

s.t. Card(u') < k, ^ ' 

in the variable w £ TV, where k is again a parameter controlling sparsity. We can rewrite this 

ip{k) = min min ||y — AT diag(u)?i;||2 

l^ii<fc Card(w))<A; 

= min min ||y — A diag(ti)7i;||2 



ue{0,l}P 

min min min \\y — X diag(u)iyw 



|2 
l2) 
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which, after minimizing exphcitly over v, becomes 



rp {y^ X diaLg,{u)w^'^ 



mm mm y y , wtv^- t \ ■ 

i'^u<k ||«'l|2=i u;-* diag(n)A-' A diag(n)t(; 

This means that il^{k) < y'^y — p if and only if 

(y'^X diaig(u)w)'^ 

max max ^x^r^-J^ r~r~ ^ P- 

i^M<fc ll«'l|2=i dia.g{u)X^ X diaLg{u)w 

iiG{0,lF 

We can rewrite this condition 

wX'^{yy^ — pI)Xw < 0, when ||w||2 < 1, Card(ii;) < k 
which is equivalent to 

Xi,^{X^{yy^ - pI)X) < 0. (3) 
Here, is the k sparse maximum eigenvalue of a matrix A G Sp, defined as 



-^max(^) = max. x'^Ax 

s. t. \\x\\ = 1, Card(x) < k 



(4) 



in the variable x £ R^ . Re laxation bounds for sparse eigenvalues Xt,^^{A) were derived in 
d'Aspremont et al. ( 2007 . 20081 ). with the bound in d'Aspremont et al. I (|2007. ) written 



'^max(^) < maximize Tr AZ 

subject to ll^lli < k (5) 
Tr(Z) = 1, Z ^ 

in the variable Z G Sp. We can summarize the above derivation in the following proposition. 

Proposition 1 Given a design matrix X G R"^^ and a response vector y G R", consider the 
following subset selection problem 

ip{k) = min. ||y — XwH^ 

s.t. Card(w) < /c, 

then 

^{k) >y^y-p if and only if Xi.^^{X^{yy^ - pI)X) < 0, 
where X'^^^{-) is the sparse maximum eigenvalue function defined in @. 



3 Approximate Solutions 

The relaxation detailed in ([5]) produces a lower bound on the objective value of problem ([T|). In 
this section, we describe how to use the solution of this relaxation to produce good approximate 
solution vectors w to problem ([T|), hence produce upper bounds on the solution value. We first 
describe greedy algorithms which can be used to solve problem ([T]) independently, or to improve 
solutions extracted from convex relaxations. 
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3.1 Greedy methods 



To simplify notations here, we first define the following function, which computes the solution value 
of problem ([1]) given the support of the solution vector w G IV. Let / C [l,p] be a index subset 
such that Wi = a i ^ I , we write /'^ its complement in [l,p] and let 

IJ,{I) = min — Xt(;||2 (6) 

tOjc=0 

in the variable w G R''. Note that while computing the optimal value of problem ([1]) is NP-Hard, 
computing in ([6]) is equivalent to forming a QR decomposition of the matrix Xj G ji^xk 

where k is the cardinality of the support set /. 

We can greedily construct approximate solutions to ([6]) by scanning variables at each iteration 
to increase (or decrease) the size of the support as in the forward greedy algorithm is detailed in 
Algorithm [TJ The backward greedy algorithm is similar but starts from the full support [l,p] and 
progressively removes points. 

Algorithm 1 Forward Greedy Algorithm. 

Input: X £ R"^P, y £ R", target cardinality k^'^'^''^ 

1: Initialization: Iq = 0. 

2: for i = 1 to A;*^^s^* do 

3: Compute i^ = argmaXj^j-^_^ ^{I^-i U {i}) 

4: Set Ik = Ik-i U {ik} and compute Wk as the minimizer of /u(/fc) in 
5: end for 

Output: Support sets Ik for w in problem ([T]), with A; = 1, . . . , k^^^^'^^ . 



3.2 Randomization 



As in the MAXCUT relaxation by iGoemans and Williamson (|l995l l for example, we can use the 
matrix solution to the relaxation in ([5]) to generate good approximate solutions to problem ([1]). 
The solution matrix Z in ([5]) can be understood as a covariance matrix, and we use it to generate 
Gaussian vectors z ~ A/'(0, Z). The k indices corresponding to the k largest magnitude coefficients 
of the sample vectors z then provide support sets / corresponding to nonzero coefficients in w. 
Given these support sets, one then solves for //(/) in ([6]) to get upper bounds on the optimal value 
of ([1]) and approximate solution vectors w. 

In the next section, we will also consider another much simpler randomization procedure whose 
performance can be completely characterized. This second procedure does not require solving 
relaxation ([5]) , but simply computing a leading eigenvector x of the matrix j4 in (|4]) . Good approx- 
imate solutions z E R^ to problem are then randomly sampled with Zj = 1/ \/A; with probability 
Pi = /c|xi|/||x||i and = otherwise. We then prune z using a few backward greedy step, whenever 
Card(2;) > k. While the complexity of this procedure is much lower than that of the full greedy 
algorithm, we will see in the next section that it produces solutions of comparable quality. 



3.3 Branch-and-bound algorithm 

As in Furnival and Wilson Jr (|2000l ): lMoghaddam et all (|2008l ). we can develop a branch-and-bound 



algorithm for finding optimal solutions to ([T]). Suppose we are looking for a vector in R^ with at 
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most k non-zero components, we need to enumerate at most (^) subsets to find the best one. We 
start by dividing all possible subsets into two branches, one containing the first variable and one 
which does not. We further branch each of these branches into two, one containing the second 
variable and one not, etc. At each node of the search tree, we have a subproblem that excludes 
certain variables (depending on branching decisions made so far). For each subproblem, we generate 
lower bounds using Proposition[T]by solving relaxation ([5]), and upper bounds when there are exactly 
k variables left on the branch. We also generate upper bounds by applying a combination of the 
greedy algorithms and randomization techniques described above to the solutions of the relaxed 
problems. Obviously, we fathom a node whose lower bound exceeds the best upper bound since the 
branches diverging from this node cannot contain a better solution than the best solution found so 
far. 



4 Tightness 

The spars e eigenvalue problem in (|4l) is closely connect e d to the fc-Dense-Subgraph problem de- 
scribed in iKortsarz and Peleel (jl993f ): iFeige et al.l (l200lh : iFeiee and Langberd (|200lh for example. 
The /c-Dense-Subgraph problem seeks to find a principal submatrix of A of dimension k with largest 
coefficient sum. This is written 

max Au 

l^uKk 

in the variable u G {0, 1}^. On the other hand, the problem of computing a sparse maximum 
eigenvalue can be written 



JA)= max max u^(Aoxx"'")u 

lTu<k \\x\\ = l 



in the variables x £ TV, u G {0, 1}^. We thus observe that computing sparse eigenvalues means 
solving a /c-Dense-Subgraph problem over the result of an inner eigenvalue probl em in x. Below, we 



first r ecall an approximation result on the backward greedy algorithm used in iMoghaddam et al 
( 20081 ) . which applies to positive semidefinite matrices A. 



Proposition 2 Let A G Sp, with ^ ^ and k > and suppose diag(A) > 0. We have 
where X^^xi^) ^•s optimal value of problem 



(7) 



Proof. From ([Horn and Johnsonl . Il985l . §4.3.14), when ^ ^ 0, we have 

'^max(^) ^ • I 1 -^max(^) 



for any i € [l,p — 1]. A simple recursion then gives the desired re sult, m 

When A is not positive semidefinite, we can adapt results from lFeige and Seltser to show 



kjk-l) 
p{p - 1) 



Amax(vl) < Xi^iA) < Amax(^). 



(8) 
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When the coefficients of A are nonnegative, we ca n obtain approximation bounds for basic 
randomization techniques similar to those developed in iFeige and Seltserl (119971 ) for the A:-Dense- 
Subgraph problem. The approximation ratio in this case also decreases as k/p, which shows that the 
randomization algorithm has a performance comparable to that of the backward greedy method, 
while being significantly cheaper on large scale problems. 

Proposition 3 Let A G Sp such that Aij > 0, i, j = 1, . . . ,p and k > 1. The sparse eigenvalue 
problem defined in @ was written 



-^max(^) — 

in the variable x G R^. We then have 

k 



max. x'^Ax 

s. t. \\x\\2 < 1, Card(x) < k 



p 



fl{k,p)Xrn.M) < ALx(^) < SDPk{A) < A„,ax(^) 



where 



Hik,p) 



p- 



k^ 



1 



(9) 
(10) 



whenever k > p^^'^, for p sufficiently large, where SDPk{A) is the optimal value of (0). 

Proof. To maintain the parallel with Feige and Seltseil (1997), we write Z = xx'^ , where x is a 
leading eigenvector of A. The matrix Z then satisfi es Tr Z = 1 and Z ^ 0. The upper bound 
in ([9]) follows directly from d'Aspremont et al. ( 200?! ) and we focus here on the lower bound. We 
randomly sample vectors z G such that 

1/^/k with probability pi = k^fZii/ S, 
otherwise. 

where S = X]f=i ^f^i- We then have 

E[z^Az] = Ty[A^[zz^\) 

p 

= kAij \J ZiiZ^j I 

p 

> kAij^ ZiiZjj/p 

> -Tv{AZ) 
P 

where the first inequality uses Tr Z = 1 and the last (Cauchy) inequality follows from the fact that 
Z ^ and A > 0. Now, let q = Prob[z^^z < E[z^Az]//3] for some /3 > 1, we have 



which means 



E[z'^Az] < q 



E[z^ Az 



+ (1 



l^Al 



(3-1 



^l^Al/kB[z^Az] -1 
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because I'^Al/k > B[z'^Az]. Now, using ChernofF's inequality as in (|Feige and Seltseii . 119971 . Lem. 
4.1) produces 

Prob [Card(z) - l^p > tl^p] < e 3—, 

so, as in ( Feige and Seltseii . 119971 . Th. 4.1), when k > p^^^ 



Prob 



Card(z) >k(l + k'^^'^ 



< e 



We have 



l^Al ^ pl^Al ^ p2A^ax(A) _ p^ 

kElz^Az] - k'^TrAZ - k^TrAZ k"^ 



which follows from ^ > 0, TrAZ = Amax(-4), Tr Z = 1 with Z ^ 0. When k > p^^^ and p is large 
enoug h so that p2e-p'/'/3//j2 ^ 

we can enforce 



/3> 



and thus get 



/3> 



l_p2g-pV9/3/^2 



epi/«/3 _ lTAl/kE[zTAz]' 
which means, using the bound on q derived above, 

(3-1 



1 



> 



> e 



/Sl^Al/kBiz^Az] - 1 
which, combined with the deviation bounds detailed above, yields 



Proh[z^ Az > Blz'Az]/^] = l-q> e 



pi/9/3 



> Prob 



Card(z) >k[l + k 



-1/3 



This shows that by sampling enough points z, we can generate a vector zq G such that 



^0 



k 



Azq > — Tr( AZ) and Card(zo) < A; ( 1 + k'^^^ 
PP V 



If we remove at most k"^^^ variables from zq using the backward greedy algorithm described in the 
previous section, ([8]) shows that we loose at most a factor 



kik-1) 



2 3 



(fe + P/3)(fe + fc2/3_l) ^2/3 
and, when p is large enough, we obtain a point z^ such that 



1 



^2/3 



zlAzk > - ( 1 
p 



2 \ Tr{AZ) 



fcl/3 



/3 



l-Zfclb ^ 1 and Card(zjfc) < k, 



which means that is a feasible point of problem dH), and yields the desired result. ■ 

Note that the randomization procedure detailed in the proof above is simpler than the one we 
used in Section 13.21 producing bounds on the performance of the later one is unfortunately much 
harder. We can directly extend this last proposition to problems where A has negative coefficients, 
but the bound is not proportional in this case. 
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Proposition 4 Let A £ Sp and k > 1. We have 

minjo, min k + -fi{k,p)X^,,{A) < Xi,^{A) < SDPk{A) < X^,,{A), (11) 

where SDPk{A) is the optimal value of ([5]) and fi{k,p) is defined in (jlOp . whenever k > p^^^ and 
p is sufficiently large. 

Proof. The function A^j^^l") defined in ([4]) is convex as a pointwise maximum of affine functions. 
This implies 

Amax(^) > ALx(^ - min^.j 11^) + min^,j(l^x)2 

for some vector x satisfying ||x||2 = 1 and Card(x) < k. The matrix A — miiiij Aij 11^ is 
nonnegative and Proposition [3] shows that 



k 

Amax(^-minAij 11^) > -fi{k,p)Xme.AA-UimAij ll'^). 

IJ p IJ 



We then get 



k 

-IJ.{k,p)Xraa^{A - min Aij 11^) + min Ajj(l^x)^ 
p V 

k 

> -/i(A:,p)(Amax(^) - mm Aij (l^yf) + mmAij{l'^xf 

P «i ^j 

k 

> -/i(fc,p)Amax(^) + minAij(l'^x)^ 

p u 

when minjj Aij < 0, which follows from the convexity of Amax('); where y is a leading eigenvector 
of j4 — miujj Aij 11^. We conclude using (l-^x)^ < \\x\\1 < Card(x)||x||2 = k. ■ 

5 Convex Minimization Algorithm 

The relaxation in Section [2] meant solving 

maximize Tr MZ 
subject to \\Z\\i < k 

Tr{Z) = 1, Z ^ 

in the variable Z G Sp, where M G Sp was formed as M = X'^{yy^ — pT)X. We compute the dual 
of this problem by first writing it in a saddle-point format. 

min max Tr M Z + X(k — \\Z\\i) 

A>0 Tr(Z)=l 



which is also 



min max min Tr Z(M + XY) + kX 

A>0 Tr(Z)=l ||r||oo<l 
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in the variables Z^Y € S„. We can rewrite this as 



which is equivalent to 



min max TrX(M + y) + A;y 

FGSp Tr(X)=l 
X>-0 



min A„,ax(M + F) + A;||y|U (12) 



in the variable Y G Sp. This is a maximum eigenvalue minim ization prqblern and can be solved 
efficiently using for example smooth first-order algorithms as in iNesterov Given an a priori 



bound on suboptimality, the total complexity of obtaining a solution up to accuracy e then grows 
as 



Given an approximate solution y G Sp to the dual, we can reconstruct a corresponding primal 
solution Z by first solving 

Z = argmax Tr Z{M + Y) 

Tr(Z)=l 

and checking if ||^||i < k (this last condition will always be satisfied if Y is optimal). 

6 Numerical Results 

Table [T] presents numerical experiments using branch-and-bound on a set of small artificial problems. 
We generate normally distributed matrices X G R"^^, a random sparse vector w whose cardinality 
is at most k, and a righthand side vector y, which is equal to Xw + e, where e S R" is noise. The 
last four columns are related to the performance of the B&B algorithm: the first gives the smallest 
number of nodes visited by the algorithm, the second provides the average number of nodes visited 
over all instances, the third shows the number of nodes in the complete enumeration tree while 
the fourth lists the average speedup. These results suggest that the lower bound obtained in this 
paper is effective in fathoming a significant number of nodes in the search tree. Out of these 160 
small test instances, the forward greedy algorithm found the optimal solution for 105 problems, 
whereas the randomization algorithm followed by a greedy improvement step (which will be referred 
as the enhanced randomization algorithm fro m now on) was ab l e to fi nd the optimal solution for 
113 problems. Unfortunately, the authors of Moghaddam et al. ( 20081 ) di d not release a software 



package and the "leaps and bounds" package released by the authors of iFurnival and Wilson Jr 



does not output the number of nodes it visits so direct comparisons were not possible 



Table 1: Number of nodes visited by the branch-and-bound algorithm. 



p 


n 


k 


No. instances 


B&B (Best) 


B&B (Average) 


il) 


Speedup (Avg.) 


20 


10 


2 


100 


35 


194 


380 


2 


30 


15 


3 


50 


330 


4 799 


24 360 


5 


40 


20 


4 


10 


42 236 


98 236 


2 193 360 


22 


50 


25 


4 


2 


71 552 


96 734 


5 527 200 


57 
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On larger instances where p = 100 and n = 50, the cardinality of w was set to 2 and 4. Figured] 
plots lower bounds (Low. Bnd.) on ([1]) generated by solving relaxation the coarse solution 
points (Primal) extracted from the matrix Z sol ving ([5l), the solutions (Greedy) obtained by the 
forward greedy algorithm, the LARS algorithm lEfron et al. I (|2004l l. and the enhanced solutions 
(Rand) obtained by applying the randomization algorithm detailed in Section 13.21 to the matrix 
Z solving ([5]). We observe that around the true cardinality of w used in generating the problem 
instances, the enhanced relaxation sometimes outperforms both the forward greedy algorithm and 
LARS and always performs at least as good as the best of these two methods. 




More realistic data sets were generated with an image compression setting in mind. X G R"^^ 
is now an overcomplete dictionary of Gabor wavelets, and y is an image patch of size r x r obtained 
from an actual image. We set r = 10 for all the experiments. We first solve this batch of problems 
(for p = 24 and n = 16) with the B&B algorithm where the target cardinality is either 2, 3, or 4. We 
then compare the performance of the forward greedy algorithm and the enhanced randomization 
algorithm of Section [3^21 Table [2] shows that the modified randomization algorithm finds the optimal 
solution in most cases. 
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Table 2: Number of instanced solved by greedy and randomization algorithms on image data. 



Dimensions 




Greedy 


Randomization 


P 


n 


k 


No. instances 


No. solved 


Max. Rel. Gap 


No. Solved 


Max. Rel. Gap 


24 


16 


2 


10 


9 


0.22 


9 


0.90 


24 


16 


3 


10 


8 


0.70 


9 


0.16 


24 


16 


4 


10 


8 


0.94 


9 


0.31 



Most of our experiments so far were focused on finding exact solutions to small instances of 
problem ([TJ . We also tested the numerical complexity of our methods on larger problems for which 
we only sought good upper and lower bounds. Computing times for solving relaxation ([5]) on 
increasingly large Gaussian random problems (generated as above) are reported in Table [3j 

Table 3: CPU time versus problem size. 



Problem size p 


CPU time 


100 


h 00 m 07 s 


250 


h 01 m 32 s 


500 


h 10 m 19 s 


1000 


1 h 22 m 59 s 



Acknowledgments 

The last author would like to acknowledge partial support from NSF grants SES-0835550 (CDI), 
CMMI-0844795 (CAREER), CMMI-0968842, a Peek junior faculty fellowship, a Howard B. Wentz 
Jr. award and a gift from Google. 

References 

E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is mucli larger than n. Annals 
of Statistics, 35(6):2313-2351, 2007. 

A. d'Aspremont, L. El Ghaoui, M.I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PGA 
using semidefinite programming. SI AM Review, 49(3):434-448, 2007. 

A. d'Aspremont, F. Bach, and L. El Ghaoui. Optimal solutions for sparse principal component analysis. 
Journal oj Machine Learning Research, 9:1269-1294, 2008. 

D.L. Donoho and J. Tanner. Neighborliness of randomly projected simplices in high dimensions. Proceedings 
of the National Academy of Sciences, 102(27):9452-9457, 2005. 

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2): 
407-499, 2004. 

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. 
IEEE Transactions on Image Processing, 15(12):3736-3745, 2006. 

U. Feige and M. Langberg. Approximation algorithms for maximization problems arising in graph partition- 
ing. Journal of Algorithms, 41(2):174-211, 2001. 



11 



U. Feigc and M. Scltser. On the densest fc-subgraph problem. Technical report, Department of Applied 
Mathematics and Computer Science, The Weizmann Institute, 1997. 

U. Feige, D. Peleg, and G. Kortsarz. The dense fc-subgraph problem. Algorithmica, 29(3):410 421, 2001. 

G.M. Furnival and R.W. Wilson Jr. Regressions by leaps and bounds. Technometrics, 42(1):69 79, 2000. 

M.X. Goemans and D.P. Williamson. Improved approximation algorithms for maximum cut and satisfiability 
problems using semidefinite programming. J. ACM, 42:1115-1145, 1995. 

DJ Hand. Branch and bound in statistical data analysis. The Statistician, pages 1-13, 1981. 

R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985. 

G. Kortsarz and D. Pcleg. On choosing a dense subgraph. In Foundations of Computer Science, 1993. 

Proceedings., 34th Annual Symposium on, pages 692 701, 1993. 

J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. 
SIAM Multiscale Modeling and Simulation, 7(1):214-241, 2008. 

N. Mcinshausen, G. Rocha, and B. Yu. A tale of three cousins: Lasso, 12boosting, and danzig. Annals of 

Statistics, 35(6):2373-2384, 2007. 

B. Moghaddam, A. Gruber, Y. Weiss, and S. Avidan. Sparse regression as a sparse eigenvalue problem. In 
Information Theory and Applications Workshop, 2008, pages 121-127, 2008. 

PM Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Trans- 
actions on Computers, 100(26):917-922, 1977. 

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227-234, 1995. 

Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2003. 

R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal statistical society, 
series B, 58(l):267-288, 1996. 



12 



